BiBERT: Accurate Fully Binarized BERT
139
TABLE 5.6
Quantization results of BEBERT on GLUE
benchmark. The average results of all tasks
are reported.
Method
#Bits
Size
GLUE
BERT-base
full-prec.
418
82.84
DynaBERT
full-prec.
33
77.36
DistilBERT6L
full-prec.
264
78.56
BinaryBERT
1-1-4
16.5
78.76
BEBERT
1-1-4
33
80.96
TinyBERT6L
full-prec.
264
81.91
TernaryBERT
2-2-8
28
81.91
BinaryBERT
1-1-4
16.5
81.57
BEBERT
1-1-4
33
82.53
Inspired by the empirical opinion in [3] that convolutional neural networks can improve
little accuracy if using ensemble learning after the KD procedures, the authors removed the
KD during ensemble for accelerating the training of BEBERT. Although the two-stage KD
performs better in [106], it is time-consuming to conduct forward and backward propaga-
tion twice. Ensemble with prediction KD can avoid double propagation and ensemble can
even remove the evaluation process of the teacher model. The authors further conducted
experiments to show whether applying KD in ensemble BinaryBERT has a minor effect on
its accuracy in the GLUE datasets, showing that BEBERT without KD can save training
time while preserving accuracy. They further compared BEBERT to various SOTA com-
pressed BERTs. The results listed in Table 5.6 suggest BEBERT outperforms BinaryBERT
in accuracy by up to 6.7%. Compared to the full-precision BERT, it also saves 15× and 13×
on FLOPs and model size, respectively, with a negligible accuracy loss of 0.3%, showing the
potential for practical deployment.
In summary, this paper’s contributions can be concluded as: (1) The first work that
introduces ensemble learning to binary BERT models to improve accuracy and robustness.
(2) Removing the KD procedures during ensemble accelerates the training process.
5.9
BiBERT: Accurate Fully Binarized BERT
Though BinaryBERT [6] and BEBERT [222] pushed down the weight and word embedding
to be binarized, they have not achieved to binarize BERT with 1-bit activation accurately.
To mitigate this, Qin et al. [195] proposed BiBERT toward fully binarized BERT models.
BiBERT includes an efficient Bi-Attention structure for maximizing representation infor-
mation statistically and a Direction-Matching Distillation (DMD) scheme to optimize the
full binarized BERT accurately.
5.9.1
Bi-Attention
To address the information degradation of binarized representations in the forward prop-
agation, the authors proposed an efficient Bi-Attention structure based on information
theory, which statistically maximizes the entropy of representation and revives the atten-
tion mechanism in the fully binarized BERT. Since the representations (weight, activation,
and embedding) with extremely compressed bit-width in fully binarized BERT have lim-